SIDEKICK: TO DO - Do This Last
In this project we performed a segmentation and profiling analysis using the following step-wise process:
HighRetention used as the target. The
decision tree algorithm was employed to segment the feature space into 8
segments, corresponding to the leaves in the decision tree. The best
fitting “pruned” tree was selected, for an optimal balance between
relative error and complexity. The decision rules used by the tree
algorithm were used on the data to create the corresponding
segments.The original dataset consisting of 5000 customers and 60 customer features. During the preprocessing phase, these features were subjected to standard checks, cleaned and prepared for modeling.
The following preprocessing steps are worthy of note:
DataLastMonth, DataOverTenure,
EquipmentLastMonth, EquipmentOverTenure,
VoiceOverTenure were set to 0. Missing values of other
variables were imputed with the standard strategies of using the mode
for categorical features and the mean for numerical features.Internet and HomeOwner were
recoded as ordinal integer features with values 1,…,5 and 0, 1,
respectively. The integer ordinal was chosen for Internet
to correct for inconsistencies in labelling, and intended to reflect
tiers of internet services offered by the company.As a useful conceptual strategy, to aid the exploration and selection, given the large number of features, the following grouping of features into categories was used:
New derivative features were then created, and preexisting feature
transformed. Of particular interest was the preexisting feature
PhoneCoTenure, which indicated the number of months a
customer has been with the company. We used this feature to identify
long-term vs short-term, i.e. high-vs-low retention customers by
creating a new feature HighRetention, indicating which
customers had a tenure greater than the 75% percentile, namely 59
months.
SIDEKICK: TO DO - PhoneCoTenure distribution
goes here
The following five new derivative features were also created:
TotalDebt = CreditDebt + OtherDebt: Total customer
debt.AvgCardSpendMonth = CardSpendMonth/CardItemsMonthly:
Average monthly credit card spending per item. Set to 0 if
CardItemsMonthly == 0.AvgValuePerCar = CarValue/CarsOwned: Average value per
car owned. Set to 0 if CarsOwned == 0.TechOwnership = OwnsFax + OwnsGameSystem + OwnsMobileDevice + OwnsPC:
Number of technological items owned out of 4 possible (here the binary
ownership features are 0/1 encoded)NumAddOns = Multiline + Pager + ThreeWayCalling + VM:
Number of account add-ons out of 4 possible (again the binary add-on
features are 0/1 encoded)In addition, due to highly skewed distributions, the four features
CardSpendMonth, DataOverTenure,
VoiceOverTenure, HHIncome,
CarValue. were transformed to standardized versions,
i.e. they were transformed by subtracting the mean and dividing by the
standard deviation). Although these features were not explicitly used
during the segmentation process, they were effectively used, as all
segmentation features were numeric and standardized.
It made sense to treat missing values as zero. In particular, we
choose to do this for DataLastMonth,
DataOverTenure, EquipmentLastMonth,
EquipmentOverTenure, VoiceOverTenure, since
presumably the company has access to this information, and if it isn’t
present, we assumed it can be treated as 0.
Finally for the remaining features, we use a standard imputation strategy, using the mode for categorical features and mean values for numerical features.
The preexisting and engineered features were used for exploratory data analysis, in which a subset of useful customer features was identified for use in segmentation and profiling.
Overall, the ability to provide useful and potentially novel insights through segmentation and profiling was the main criteria in feature selection. In particular, it was surmised that stakeholders and decision makers may be interested in identifying customer that may end up having a long tenure but that currently do not.
For that reason, certain variables with potentially useful
information about internal customer behavior over a long tenure were
omitted, namely DataOverTenure,
EquipmentOverTenure, and VoiceOverTenure, that
is, feature whose future values over a long tenure would be currently
unknown.
Moreover, other time-dependent features such as Age,
Employment were identified as potentially confounding, that
is, features with strong associations to long tenure (and thus to the
derivative retention feature), and were also omitted.
Finally, the specific interest in using k-means segmenting as an unsupervised method, due to its ability to detect unknown patterns, was the next most important criteria, and had a large impact on the choice of features. Primarily, it resulted in a choice of purely numeric features for the segmentation process. Then, using the segments thus constructed, categorical feature characteristics for high and low retention customers were identified and discussed.s
With these criteria in mind, the following hand-picked selection of fifteen customer demographic, behavioral and financial features was used in out our segmentation
# customized set segmentation targets and features
seg_target <- "HighRetention"
seg_feats <- c("CommuteTime", "HouseholdSize", "TownSize", "CardItemsMonthly",
"DebtToIncomeRatio", "HHIncome", "CarsOwned",
"TVWatchingHours", "Region", "TotalDebt",
"CardSpendMonth", "HHIncome", "CarValue", "TechOwnership", "NumAddOns")
The aim of supervised learning is in general to find meaningful
associations between the features and the target. In this case, we
employed a supervised-learning-based segmentation method in the hopes of
finding a detectable useful pattern between the custom customer features
selected for segmentation and the target customer feature
HighRetention, thus capturing a meaningful association
between segments and high and low value customer segments.
A decision tree algorithm was used to discover statistically
meaningful decision rules among the numerical segmentation features. The
rules are produced by the algorithm by optimizing a (mathematically
defined) criterion which essentially measures how well the rules
classify the observations according to the target feature, in this case,
the binary customer retention feature HighRetention.
The algorithm is so-named because the resulting decision rules can be seen as a partition of the feature space into segments, or alternatively, as a sequence of choices about how to place observations (in this case customers) into segments, and these rules can be easily visualized in a tree diagram.
Several decision trees were fit, and the optimal decision tree was chosen which balanced model complexity and accuracy. The inclusion of model complexity in this choice helps improve the expected ability of the fit to generalize to unseen data, that is, to future customers.
The decision rules obtained by the optimal decision tree were used to number all leaves in the tree diagram in order from left to right (note all nodes contain at least one observation). These are the segment labels, and we can assign observations to these segments based on the decision rules.
SIDEKICK: TO DO - Decision Tree Diagram goes here
Note that the decision rules corresponding to this decision tree only involve a small subset of the segmentation features. The aforementioned complexity minimization criteria employed, which potentially helps reduce generalization error (and thus, reflects a true relationship between features and response, rather than a spurious artifact of the given dataset), often results in such a “pruned tree”.
For this reason, the resulting decision tree segmentation was seen as perhaps less than ideal, given that potentially useful information contained in the other features wasn’t utilized.
We include here a plot of segmentation features ranked by importance with respect to their use by the decisions trees from which the optimal tree was selected (we omit a more technical discussion of how these features are selected (that is, what precisely “important” means), and invite the curious reader to research the topic further).
SIDEKICK: TO DO - Feature Importances Plot goes here
Note that the most important features are those seen in the decision rules for the pruned tree.
SIDEKICK: TO DO - Add Distribution Plot
SIDEKICK: TO DO - Add Stacked Bar Plot
In general, unsupervised learning methods are employed to capture novel or unexpected relationships amongst features and observations, that is, without the assumptions implicit in the selection of a special “target” to associate the remaining features to. The goal that at least some of these segments should provide some insight about high and low retention customers, that is, that some segments should prove to contain more high or low retention customers than others.
In our case, in order for the method to remain truly unsupervised, we
wished to suppress any information related to customer retention in the
learning algorithm. Accordingly, the tenure-related features
PhoneCoTenure and HighRetention were omitted.
The hope was that in doing so, the resulting segmentation would contain
meaningful information about high and low customer retention thus
adding weight to the segmentation pattern thus discovered (since it
contained no assumptions or information about retention, and yet such
associations were discovered independently).
We mentioned previously that k-means segmenting can only use numerically encoded features. This is because it relies on a notion of distance between points in the feature space, which does not apply to non-numeric features.
Given that the features are measured on vastly different scales, it is possible that large scale features can have undue influence on the resulting k-means segmenting. Following standard procedures, the segmentation features were thus standardized to reduce this risk.
Naively when using k-means to perform segmentation, the practitioner must choose the number k of segments beforehand, but it is usually better if this choice can be informed. The elbow method exists for this purpose. A plot of k versus a measure of “homogeneity” of the segmented observations is generated – specifically, this is “total within-segment variation”.
The elbow method looks for a elbow (kink) in the k vs total within segment variation graph, and selects the minimum just before the kink. Similar to decision tree pruning, this choice of k is thought to provide a nice balance between model complexity, in this case, as measured by number of segments,and accuracy, in this case, as measured by how well the segmented observations “group together”, i.e. how low the variation is within the segment.
Note that the decision tree segments a
SIDEKICK: TO DO - Elbow Plot Goes Here
There is no clear kink, so we are freer to choose a value of \(k\) ourselves, we will select \(k=8\). This still balances complexity with the need to detect differences between segments, and provides more fine-grained information when considering high-vs-low retention customers than a smaller number of segments.
Furthermore, a good deal of trial-and-error, and justified this choice of k by revealing that it provided a good separation of high vs. low retention customers by segment, as well as a good separation of features, as determined by the variance (spread) across segments of the segment means for each feature (more below).
Incidentally, this choice of k was only weakly informed by the fact that it somewhat facilitated direct comparison with decision tree segments, although not explicitly, since the statistical comparison measures (primarily variance of segment means) would just have easily worked for different numbers of segments.
SIDEKICK: TO DO - Add Distribution Plot
SIDEKICK: TO DO - Add Stacked Bar Plot
In order to evaluate and compare the segmentation solution, we relied on the following evaluation criteria:
As mentioned previously, the decision tree model only used 5/15 \(\approx\) 33% of the total number of features, whereas k-means intrinsically makes use of all features.
Conclusion: Due to the omission by the decision tree model of 10/15 \(\approx\) 67% of the total number of features, k-means has clearly better segment feature space utilization.
To measure the degree to which the segments are well-separated from each other, we use two statistical measures of separation, the total and average variance (across segments) of the segment means, which we call “total separation” and “average separation”. Specifically, this is the sum and the average of the variances across all features of the featire means across all 8 segments.
The reasoning for the use of this measure is as follows. Since the mean of each feature is a measure of it’s “center” (and indeed, the centroid of each segment is the vector of feature means), the variance of means capture how far the within-segment means of each feature are from their overall center.
The sum and average variance of the feature means taken over all features captures how far the within-segment feature centers are from their overall center, and thus ostensible, from each other.
Conclusion: k-means is clearly better at separating segments, with higher total and average segment separation.
To determine how well the segmentation methods separate high and low retention customers, we look at the overall picture provided by proportion of high segmentation customers per segment, per method. When considering these results, note that there is no natural correspondence between the numbers assigned to each segment by each method.
We notice that k-means appears better able to separate high retention customers, with segments 6 and 8 having roughly 46% and 49%, respectively. The decision tree also has two segments with high percentages of high retention customers, namely 2 and 4, but these are lower, at roughly 40% and 44% respectively.
The decision tree appears better at separating low retention customers than k-means. For k-means, only one segment has a low percentage of high retention customers (hence a high percentage of low retention customers), namely segment 4 at roughly 8%, while for the decision tree there are 4 segments with low percentages of high retention customers, between roughly 8-11%.
Conclusion: These results are somewhat mixed, however, given the potentially higher value in identifying high retention customers, we give the advantage to k-means.
After careful investigation, it was determined to use the k-means segmentation for the following reasons:
Having selected \(k\)-means segmentation, the resulting eight segments were visualized and investigated, and the results used to build the corresponding customer profiles.
In this report, some general observations are made about the eight segments and their corresponding profiles, and provide visualizations. The discussion then focuses on the two high retention segments and one low retention segment.
SIDEKICK: TO DO - Overview of Results goes here
See appendix B for summary statistics on the customer profiles, namely, the median values of the numerical segmentation features and the mode of categorical features.
SIDEKICK: TO DO - Additional Plots Go Here
As mentioned. Segments 6 and 8 had much higher retention than other segments, about 47% and 49%, respectively. Segment 4 had much much lower retention than other segments, at approximately 9% ***
See the appendix for tables of summary statistics for the segmentation features for both the decision tree and k-means methods
treeseg_summary_stats
kseg_summary_stats